Problem Statement

We have been tasked with developing an update on Monkeypox (MPX) for the leadership of our state health department. MPX is a rare disease caused by infection with the monkeypox virus. As there has been a recent MPX outbreak in Europe, we will be analyzing their data to prepare a response for an outbreak here.

We will be investigating how case rates changed over time in four different regions in Europe. We will also look at certain demographics and MPX rates by country. We will look at whether gender (male and female) affects case rates, as well as age (people 85 and older, and people younger than 85).

Methods

Visualizations

Results

Discussion

Milestone 1

Github link: https://github.com/r-public-health-group/monkey-pox

Team roles and responsibilities: https://github.com/r-public-health-group/monkey-pox/blob/main/team_roles_and_responsibilities.docx

Milestone 2

What is the data source?

We are interested in how monkeypox case rates differ by European region and population demographics. We are first utilizing a monkeypox case dataset from the European Centre for Disease Prevention and Control that spans 2022-05-09 to 2022-08-23 and includes the daily number of confirmed monkey pox cases by country and how the data was collected. Then, we will incorporate a European population dataset from , which includes European countries’ yearly population data from 2011 to 2022 based on the total population residing in that country as of January 1st of each year. Finally, we will incorporate the European Statistical System’s 2011 European census data which includes the EU country codes, sex at birth, age ranges, employment status (CAS), education status, population of locality of residence and the number of people in each strata.

How does the dataset relate to the group problem statement and question?

As we prepare our report on monkeypox case rates to the state health department, the monkeypox case data will provide us with the number of cases that will be the numerator of our case rate calculation, the population data gives us the size of the population at risk that will be the denominator of our case rate calculation, and the census data will allow us to stratify the case rates by various demographic factors.

Import Statement

All three of our CSVs are hosted on the course github page: https://github.com/PHW290/phw251_projectdata. We navigated to our respective csvs, clicked on raw data to get the standalone csv, and saved it into a url variable in this Rmd. We then used the read_csv function from the readr library to import our data into rstudio.

mpx_cases_url = "https://raw.githubusercontent.com/PHW290/phw251_projectdata/main/euro_mpx_cases.csv"
mpx_cases_df <- read_csv(mpx_cases_url)

pop_denoms_url = "https://raw.githubusercontent.com/PHW290/phw251_projectdata/main/euro_pop_denominators.csv"
pop_denoms_df = read_csv(pop_denoms_url)

census_stats_url = "https://raw.githubusercontent.com/PHW290/phw251_projectdata/main/euro_census_stats.csv"
census_stats_df <- read_csv(census_stats_url,
          col_names = TRUE,
          col_types = NULL,
          na = c("", "NA"))

Lowercase Column Names

We are now lowercasing all the column names.

mpx_cases_df <- rename_with(mpx_cases_df, ~tolower(.x))
pop_denoms_df <- rename_with(pop_denoms_df, ~tolower(.x))
census_stats_df <- rename_with(census_stats_df, ~tolower(.x))

Data Types and Descriptions

Details of key data elements are outlined below. All elements listed are in an appropriate data format for the joins and analysis we intend to do.

Variable: daterep from mpx_cases_df

str(mpx_cases_df$daterep)
##  Date[1:2987], format: "2022-05-09" "2022-05-09" "2022-05-09" "2022-05-09" "2022-05-09" ...
min(mpx_cases_df$daterep)
## [1] "2022-05-09"
max(mpx_cases_df$daterep)
## [1] "2022-08-23"
unique(dplyr::count(mpx_cases_df, daterep)$n)
## [1] 29

The dates are in date format, they span from 2022-05-09 to 2022-08-23, and each date shows up 29 times.

Variables: countrycode from mpx_cases_df

typeof(mpx_cases_df$countrycode)
## [1] "character"
unique(mpx_cases_df$countrycode)
##  [1] "AT" "BE" "BG" "HR" "CY" "CZ" "DK" "EE" "FI" "FR" "DE" "EL" "HU" "IS" "IE"
## [16] "IT" "LV" "LT" "LU" "MT" "NL" "NO" "PL" "PT" "RO" "SK" "SI" "ES" "SE"
length(unique(mpx_cases_df$countrycode))
## [1] 29

The variable countryexp is the country name where countrycode is it’s two letter counterpart. Both are in character format. There are 29 countries respresented.

Variable: source from mps_cases_df

typeof(mpx_cases_df$source)
## [1] "character"
unique(mpx_cases_df$source)
## [1] "TESSy" "EI"

Source is in a character format. There are two sources of data: “TESSy” and “EI.”

Variable: confcase from mpx_cases_df

typeof(mpx_cases_df$confcases)
## [1] "double"
unique(mpx_cases_df$confcases)
##   [1]   0   1   4   2   3   6   5  20   9  11  82  21  14  13   7  10   8  16
##  [19]  22  26  12  19  15  27  30  33  47  25  31  18  41  34  40  38  61  43
##  [37]  42  72  46  48  66  70  55  56  95  97  53  23  90  76  89  86  24  28
##  [55]  67  17 101 110  74  93  83  58 106  71 130 151  79  68 109 144 184  69
##  [73]  29 176 191  78 158 102 199 112  37 131  44 113 155 160 140 655 136 170
##  [91]  60 172  45 159 388  63 139 114 137 284 147  51  77 121 124 128 178  91
## [109]  73  65
min(mpx_cases_df$confcases)
## [1] 0
mean(mpx_cases_df$confcases)
## [1] 5.715434
max(mpx_cases_df$confcases)
## [1] 655

The confirmed cases are in a numeric data format and range from 0 to 655. The mean is 5.7 cases per day.

Variable: geo from pop_denoms_df

typeof(pop_denoms_df$geo)
## [1] "character"
length(unique(pop_denoms_df$geo))
## [1] 54

The values in the GEO column of the pop_denoms_df are characters denoting the geopolitical region that a given row’s population data comes from. There are 54 possible values for this column, either a 2-letter country code or a code denoting the entire European Union on a given year. We will eventually merge these 2-letter country codess with the 2-letter country codes in the euro_mpx_cases csv, so the character data type will suffice.

Variable: obs_value from pop_denoms_df

summary(pop_denoms_df$obs_value)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     31863   2088384   7040272  44417154  19702267 513206391
typeof(pop_denoms_df$obs_value)
## [1] "double"

The values in the obs_value column are the observed population on January 1 of the reported year. These values are doubles representing the total population count of a given region, so we can use these values directly as the denominators in our case rate calculations.

Variable: country_code from census_stats_df

typeof(census_stats_df$country_code)
## [1] "character"
unique(census_stats_df$country_code)
##  [1] "AT" "BE" "BG" "CH" "CY" "CZ" "DE" "DK" "EE" "EL" "ES" "FI" "FR" "HR" "HU"
## [16] "IE" "IS" "IT" "LI" "LT" "LU" "LV" "MT" "NL" "NO" "PL" "PT" "RO" "SE" "SI"
## [31] "SK" "UK"
length(unique(census_stats_df$country_code))
## [1] 32

Country Code is a character data type and there are 32 different two-letter country codes in this dataset. We will obtain the different regions from the country_code variable.

Variable: sex from census_stats_df

unique(census_stats_df$sex)
## [1] "F" "M"

Sex is a character data type, and there are two categories: male, female. We might use sex as a demographic variable to stratify case rates by.

Variable: age from census_stats_df

typeof(census_stats_df$age)
## [1] "character"
unique(census_stats_df$age)
## [1] "Y_GE85" "Y_LT15" "Y15-29" "Y30-49" "Y50-64" "Y65-84"

Age is character data type that includes six different age ranges in the dataset: <15, 15-29, 30-49, 65-84 & >85. We might use age as a demographic variable to stratify case rates by.

Variable: cas from census_stats_df

typeof(census_stats_df$cas)
## [1] "character"
unique(census_stats_df$cas)
## [1] "ACT"  "EMP"  "INAC" "UNE"  "UNK"

Cas is a character data type and represents the employment status of an individual. There are five different statuses, and we might use these to stratify case rates by.

Variable: edu from census_stats_df

typeof(census_stats_df$edu)
## [1] "character"
unique(census_stats_df$edu)
## [1] "ED1"  "ED2"  "ED3"  "ED4"  "ED5"  "ED6"  "NAP"  "NONE" "UNK"

Education status is in a character format, and there are nine different statuses of education. We might use education status to stratify case rates by.

Variable: res_pop from census_stats_df

typeof(census_stats_df$res_pop)
## [1] "character"
unique(census_stats_df$res_pop)
## [1] "500000-999999" "10000-99999"   "200000-499999" "100000-199999"
## [5] "GE1000000"     "1000-9999"     "0-1000"

Resident population (res_pop) is in a character data format. There are seven levels of population in each locality of residence.

Variable: pop from census_stats_df

typeof(census_stats_df$pop)
## [1] "double"
range(census_stats_df$pop)
## [1]       0 1702270

The strata population is in a numeric data format. The number of people in the different location strata ranges from 0 to 1,702,270.

Milestone 3

Subset rows or columns, as needed

#acquire region dataset
regions <- read.csv('https://raw.githubusercontent.com/PHW290/phw251_projectdata/main/world_country_regions.csv')

#filtered by europe countries
regions <- regions %>%
  filter(region == 'Europe') %>%
  select(iso_3166.2, sub.region, region)

#create country code variable to join tables
regions <- regions %>% 
  mutate(country_code = substr(regions$iso_3166.2,12,13) )

#filtered mpx_cases columns
mpx_cases_df <- mpx_cases_df %>%
  select(daterep, countrycode, confcases)

#filter pop_denom for 2011 for census analysis
pop_denoms_2011 <- pop_denoms_df %>%
  filter(time_period == 2011) %>%
  select(geo, obs_value)

#filtered pop_denom by year and columns
pop_denoms_df <- pop_denoms_df %>%
  filter(time_period == 2022) %>%
  select(geo, obs_value)

#joined all three datasets together
df <- inner_join(x = regions, y = mpx_cases_df, by = c('country_code' = 'countrycode'))
df <- inner_join(x = df, y = pop_denoms_df, by = c('country_code' = 'geo'))


df <- df %>%
  select(-iso_3166.2, -region)

Create new variables needed for analysis (minimum 2)

#created case rate by ten million people
df_grouped <- df %>% 
  group_by(sub.region, daterep) %>%
  summarise(total_cases = sum(confcases), total_obs = sum(obs_value)) %>%
  mutate(case_rate_by_tenmillion = total_cases / total_obs * 10000000)
## `summarise()` has grouped output by 'sub.region'. You can override using the
## `.groups` argument.

Clean variables needed for analysis (minimum 2)

#we used an inner join and there were no discrepancies between datasets, 
#therefore are no nulls caused by joining
sum(is.na(df_grouped))
## [1] 0
#make sub.region a categorical value
df_grouped <- df_grouped %>%
  mutate(sub.region = factor(sub.region))

#make new month column & made it a categorical value
df_grouped <- df_grouped %>%
  mutate(month = strftime(daterep,"%B")) %>%
  mutate(month = factor(month, levels = c('May','June','July','August')))

str(df_grouped)
## grouped_df [412 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ sub.region             : Factor w/ 4 levels "Eastern Europe",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ daterep                : Date[1:412], format: "2022-05-09" "2022-05-13" ...
##  $ total_cases            : num [1:412] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_obs              : num [1:412] 89171711 89171711 89171711 89171711 89171711 ...
##  $ case_rate_by_tenmillion: num [1:412] 0 0 0 0 0 0 0 0 0 0 ...
##  $ month                  : Factor w/ 4 levels "May","June","July",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "groups")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ sub.region: Factor w/ 4 levels "Eastern Europe",..: 1 2 3 4
##   ..$ .rows     : list<int> [1:4] 
##   .. ..$ : int [1:103] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int [1:103] 104 105 106 107 108 109 110 111 112 113 ...
##   .. ..$ : int [1:103] 207 208 209 210 211 212 213 214 215 216 ...
##   .. ..$ : int [1:103] 310 311 312 313 314 315 316 317 318 319 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

Data dictionary based on clean dataset (minimum 4 data elements), including: Fields Variable name Data type Description

sub.region: factor, region in europe

daterep: date, year-month-day by week from may to august 2022

total_cases: num, total monkeypox cases in a subregion by week

total_obs: num, total population by subregion

case_rate_by_tenmillion: num, the rate of monkeypox in a subregion by ten million people

month: date, the month from may to august 2022

One or more tables with descriptive statistics for 4 data element

summary(df_grouped)
##            sub.region     daterep            total_cases      total_obs        
##  Eastern Europe :103   Min.   :2022-05-09   Min.   :  0.0   Min.   : 38749061  
##  Northern Europe:103   1st Qu.:2022-06-07   1st Qu.:  1.0   1st Qu.: 76566048  
##  Southern Europe:103   Median :2022-07-03   Median :  8.0   Median :106223452  
##  Western Europe :103   Mean   :2022-07-02   Mean   : 41.3   Mean   :110280452  
##                        3rd Qu.:2022-07-29   3rd Qu.: 53.0   3rd Qu.:139937856  
##                        Max.   :2022-08-23   Max.   :764.0   Max.   :189925840  
##  case_rate_by_tenmillion    month    
##  Min.   : 0.000          May   : 76  
##  1st Qu.: 0.158          June  :120  
##  Median : 1.009          July  :124  
##  Mean   : 3.048          August: 92  
##  3rd Qu.: 3.894                      
##  Max.   :40.226

Milestone 4

Visualize MPX rates from May to August 2022 in each EU region

As requested by the leadership of our state health department, we have created a bar chart to visualize monthly MPX case rates for the different regions within the EU.

plot_ly(
  df_grouped,
  x= ~month,
  y= ~case_rate_by_tenmillion,
  color= ~sub.region,
  type="bar"
) %>%
  layout(
    title="MPX Case Rates by EU Region, May to August 2022",
    yaxis=list(title="Case Rates per 10,000,000"),
    xais=list(title="Month"),
    paper_bgcolor="white",
    plot_bgcolor="white"
  )

Prepare census data

To evaluate how census data trends for a region compare to that region’s monkeypox rates, we will first calculate the % of the population in each region that is 85 years old or older and the % of the population in each region that is female. Then we will compare these rates against the regions’ monkey pox rates and evaluate any trends we see.

# First, calculate the number of females in each country.
sex_pct_df <- census_stats_df %>%
  group_by(country_code, sex) %>%
  summarize(sex_count = sum(pop)) %>%
  group_by(country_code) %>%
  mutate(country_pop = sum(sex_count)) %>%
  filter(sex == "F",
         country_pop != 0) %>%
  rename(female_count = sex_count)
## `summarise()` has grouped output by 'country_code'. You can override using the
## `.groups` argument.
# Then, calculate the number greater than or equal to 85 years old in each country.
age_pct_df <- census_stats_df %>%
  group_by(country_code, age) %>%
  summarize(age_count = sum(pop)) %>%
  group_by(country_code) %>%
  mutate(country_pop = sum(age_count)) %>%
  filter(age == "Y_GE85", 
         country_pop != 0) %>%
  rename(ge85_count = age_count)
## `summarise()` has grouped output by 'country_code'. You can override using the
## `.groups` argument.
# Join these two df's together.
demog_df <- full_join(x = sex_pct_df, y = age_pct_df, by = "country_code") 
demog_df <- demog_df %>%
  select(country_code, ge85_count, female_count, country_pop.x) %>%
  rename(country_pop = country_pop.x)

# Join the combined df with the regions df so we can map country codes to regions.
##demog_df <- inner_join(x = regions, y = demog_df, by = 'country_code')

# Calculate the % of the population that is GE85 and the % that is Female
demog_df <- demog_df %>%
  group_by(country_code) %>%
  summarize(pct_ge85 = round(sum(ge85_count) / sum(country_pop) * 100,2),
            pct_female = round(sum(female_count) / sum(country_pop) * 100, 2))

# Group the monkeypox case rate df by region and calculate regional case rate
df_case_rate <- df %>%
  group_by(country_code) %>%
  summarize(total_cases = sum(confcases),
            total_obs = sum(obs_value)) %>%
  mutate(case_rate_p_mill = total_cases / total_obs * 1000000)

# Join case rates with the regional census data
demog_df <- inner_join(x = demog_df, y = df_case_rate, by = "country_code")
demog_df <- select(demog_df, c(-total_cases, -total_obs))

Visualize census data vs case rates

From this table view, one can see that there likely isn’t a trend between a country’s percent of the population that is female and its monkeypox rate. However, a positive trend is possible between percent of the population that is 85 and older and that country’s monkeypox case rate, however a visual plot will help elucidate the relationship better.

kable(demog_df,longtable=T,booktabs=T,
      col.names=c("Country Code", "% 85 and Older", "% Female", "Case Rate per Million" ),
      caption="2011 Census Rates versus 2022 Monkeypox Rates by Country") %>%
  kable_styling(full_width=F) %>%
  kable_styling(position="center") %>%
  kable_styling(font_size=10)
2011 Census Rates versus 2022 Monkeypox Rates by Country
Country Code % 85 and Older % Female Case Rate per Million
AT 1.58 49.79 0.2497757
BE 1.57 49.42 0.5600969
BG 1.01 50.07 0.0056785
CZ 1.01 49.36 0.0378501
EE 1.19 52.64 0.0728996
FI 1.43 50.38 0.0384973
FR 1.74 50.33 0.4134357
HR 1.00 50.09 0.0625712
HU 1.15 50.69 0.0641303
IE 0.88 48.60 0.2455963
LU 1.12 48.73 0.7070232
LV 1.08 53.17 0.0207036
MT 1.08 46.36 0.5777114
PL 0.94 49.68 0.0311985
PT 1.51 50.91 0.7596644
RO 0.98 49.18 0.0178487
SI 1.14 48.77 0.1981206
SK 0.76 49.54 0.0214372
ggplot(demog_df, aes(pct_ge85, case_rate_p_mill)) +
  geom_point() + 
  geom_smooth(formula = y ~ x, method = lm, se=FALSE) + 
  labs(x='Percent of the country that is 85 years and older',
       y='Case rate of country per 1,000,000',
       title='Monkeypox rates increase in countries with more people who are 85 and older') +
  theme_light()

ggplot(demog_df, aes(pct_female, case_rate_p_mill)) +
  geom_point() + 
  geom_smooth(formula = y ~ x, method=lm, se=FALSE) + 
    labs(x='Percent of the country that is female',
       y='Case rate of country per 1,000,000',
       title='Monkeypox rates decrease in countries with more females') +
  theme_light()